Unoptimized library implementation causing CUDA API slow. by nishitnshah · Pull Request #165 · Project-HAMi/HAMi-core

nishitnshah · 2026-03-25T04:10:58Z

Every LOG_DEBUG/LOG_INFO/LOG_WARN/LOG_MSG macro was calling getenv("LIBCUDA_LOG_LEVEL") and atoi() on every invocation. On hot paths (kernel launch, memory alloc/free), this added measurable overhead from repeated linear scans of the environment block. Changes: - Add g_log_level global (default 2 = warn, matching original behavior when LIBCUDA_LOG_LEVEL is unset) - Add log_utils_init() to read the env var once at startup - Rewrite all LOG_* macros to check g_log_level instead of getenv() - Call log_utils_init() from preInit() in libvgpu.c The log level can still be controlled via LIBCUDA_LOG_LEVEL env var; it is simply read once at library load time instead of on every log statement. LOG_ERROR remains unconditional (always emitted). Signed-off-by nishitnshah <nishshah@linkedin.com>

wait_status_self() is called via the ENSURE_RUNNING() macro on every kernel launch and memory operation. It was doing a linear scan through all process slots (up to 1024) comparing PIDs to find the current process's status — O(n) per call. The process slot pointer is already cached in region_info.my_slot during init_proc_slot_withlock(). Use it directly for an O(1) fast path. The linear scan is preserved as a fallback for the edge case where my_slot has not yet been initialized. Also switched to proper atomic loads with acquire semantics for reading shared-memory fields in the slow path, consistent with the rest of the codebase. Signed-off-by nishitnshah <nishshah@linkedin.com>

…ocking in pre_launch_kernel() pre_launch_kernel() is called on every cuLaunchKernel invocation. It was calling time(NULL) — a syscall — and always acquiring pthread_mutex_lock even when the timestamp hadn't changed. Changes: - Replace time(NULL) with clock_gettime(CLOCK_REALTIME_COARSE), which is served from the Linux vDSO (no syscall, ~1-4ms resolution). This is a safe drop-in: the recording interval is 1 second, so millisecond jitter is irrelevant. CLOCK_REALTIME_COARSE gives epoch time like time(), so dump_shrreg and other consumers are unaffected. - Add double-checked locking: check the timestamp before acquiring the mutex. On the fast path (>99.99% of calls, since kernels fire thousands/sec but interval is 1s), this becomes a single memory read + integer comparison — no syscall, no mutex. - The unlocked read of region_info.last_kernel_time is safe: uint64_t reads are atomic on x86-64 and aarch64 (aligned), and a torn read would at worst cause one extra mutex acquisition. Signed-off-by nishitnshah <nishshah@linkedin.com>

…t shared memory ops rate_limiter() is called on every kernel launch when pidfound==1. It was performing 3 shared memory reads, 1 shared memory write, and 2 ensure_initialized() calls before reaching the actual rate-limiting CAS loop — all unnecessary per-call overhead. Changes: - Cache sm_limit and utilization_switch in static locals during init_utilization_watcher(). These values are set at container startup and do not change at runtime. - Use cached values for the fast-exit check instead of reading from shared memory on every call. When limiting is inactive (sm_limit >= 100 or == 0), rate_limiter becomes a single branch on a local variable. - Remove set_recent_kernel(2) — it unconditionally wrote 2 to shared memory, but the value was already 2 (set at init and never changed to anything else). This dirtied a cross-process cache line on every kernel launch for no effect. - Remove duplicate get_current_device_sm_limit(0) call (was called twice with identical arguments). - Reduce sleep(1) to usleep(1000) in the defensive recent_kernel guard (unreachable in current codebase, but safer if triggered). - The actual CAS spin loop and 10ms nanosleep backoff are unchanged, preserving correct rate-limiting behavior when 0 < sm_limit < 100. Signed-off-by nishitnshah <nishshah@linkedin.com>

oom_check() called cuDeviceGetCount() on every memory allocation, but never used the result — the variable count1 was written to and then discarded. This was a wasted driver API call on every alloc. Remove the dead call entirely. The device count does not change at runtime and is not needed by this function's logic, which only operates on the specific device passed via the dev parameter. Signed-off-by nishitnshah <nishshah@linkedin.com>

Document all P0 and P1 changes for reducing CUDA hijacking overhead: - Problem description, fix rationale, behavior preservation notes, and testing guidance for each commit. - Expected impact summary table. - Benchmarking instructions. Signed-off-by nishitnshah <nishshah@linkedin.com>

g_log_level, fp1, and log_utils_init() were defined in utils.c, which is only compiled into libvgpu.so. The shrreg-tool executable links multiprocess_mod (which uses LOG_* macros referencing g_log_level) but does not link utils.o, causing undefined reference errors. Fix: extract these definitions into a standalone log_utils.c and add it to both the main libvgpu library and the shrreg-tool executable in the CMake build files. Signed-off-by nishitnshah <nishshah@linkedin.com>

The previous optimization moved the sm_limit fast-exit before the get_recent_kernel()/set_recent_kernel(2) calls, which meant the "GPU is active" signal was no longer written when SM limiting was disabled. While current code only reads recent_kernel within rate_limiter itself, the shared memory field could be observed by external tooling or future features for OOM decisions or memory accounting. Restore the original call order: recent_kernel read/write happens unconditionally on every call, then the cached sm_limit/util_switch check determines whether to proceed to the CAS rate-limiting loop. The remaining optimizations are preserved: - Cached sm_limit/utilization_switch (eliminates 3 shared memory reads + 2 ensure_initialized calls) - Reduced sleep(1) to usleep(1000) in the defensive guard - Removed duplicate get_current_device_sm_limit(0) call Signed-off-by nishitnshah <nishshah@linkedin.com>

…er()" This reverts commit bfea6e1. Signed-off-by nishitnshah <nishshah@linkedin.com>

… behavior Restore the unconditional set_recent_kernel(2) call that was removed in the rate_limiter optimization. The write has negligible cost (~100- 200ns cache line store) compared to the other savings in this function, and removing it changes observable shared memory state which could affect external tooling or future features. The call is placed after the cached sm_limit/util_switch fast-exit, matching the original position relative to the get_recent_kernel() guard. All other optimizations (cached limits, removed duplicate sm_limit call, reduced sleep) are preserved. Signed-off-by nishitnshah <nishshah@linkedin.com>

hami-robot · 2026-03-25T04:11:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nishitnshah
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hami-robot · 2026-03-25T17:11:18Z

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

795b54d perf: cache log level to eliminate per-call getenv()/atoi() overhead
4430c4b perf: use cached my_slot pointer in wait_status_self()
df4bf32 perf: replace time() syscall with vDSO clock and add double-checked locking in pre_launch_kernel()
f8c9189 perf: optimize rate_limiter() by caching limits and removing redundant shared memory ops
810a185 perf: remove dead cuDeviceGetCount call from oom_check()
52e91a5 docs: add performance optimization documentation
6557e3a fix: move log level globals to separate log_utils.c for link visibility
9040fec fix: restore unconditional set_recent_kernel(2) in rate_limiter()
0a28f79 Revert "fix: restore unconditional set_recent_kernel(2) in rate_limiter()"
9c1c6ff fix: restore set_recent_kernel(2) in rate_limiter() to match original behavior

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

nishitnshah added 10 commits March 24, 2026 15:59

Revert "fix: restore unconditional set_recent_kernel(2) in rate_limit…

0a28f79

…er()" This reverts commit bfea6e1. Signed-off-by nishitnshah <nishshah@linkedin.com>

hami-robot bot added the dco-signoff: yes label Mar 25, 2026

hami-robot bot requested review from archlitchi and chaunceyjiang March 25, 2026 04:11

hami-robot bot added the size/L label Mar 25, 2026

nishitnshah force-pushed the perf/reduce-hijack-overhead branch from 81e5172 to 9c1c6ff Compare March 25, 2026 17:11

hami-robot bot added dco-signoff: no and removed dco-signoff: yes labels Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unoptimized library implementation causing CUDA API slow.#165

Unoptimized library implementation causing CUDA API slow.#165
nishitnshah wants to merge 10 commits intoProject-HAMi:mainfrom
nishitnshah:perf/reduce-hijack-overhead

nishitnshah commented Mar 25, 2026

Uh oh!

hami-robot bot commented Mar 25, 2026

Uh oh!

hami-robot bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nishitnshah commented Mar 25, 2026

Uh oh!

hami-robot bot commented Mar 25, 2026

Uh oh!

hami-robot bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant